Framework for feature selection of predicting the diagnosis and prognosis of necrotizing enterocolitis

Neonatal necrotizing enterocolitis (NEC) occurs worldwide and is a major source of neonatal morbidity and mortality. Researchers have developed many methods for predicting NEC diagnosis and prognosis. However, most people use statistical methods to select features, which may ignore the correlation between features. In addition, because they consider a small dimension of characteristics, they neglect some laboratory parameters such as white blood cell count, lymphocyte percentage, and mean platelet volume, which could be potentially influential factors affecting the diagnosis and prognosis of NEC. To address these issues, we include more perinatal, clinical, and laboratory information, including anemia—red blood cell transfusion and feeding strategies, and propose a ridge regression and Q-learning strategy based bee swarm optimization (RQBSO) metaheuristic algorithm for predicting NEC diagnosis and prognosis. Finally, a linear support vector machine (linear SVM), which specializes in classifying high-dimensional features, is used as a classifier. In the NEC diagnostic prediction experiment, the area under the receiver operating characteristic curve (AUROC) of dataset 1 (feeding intolerance + NEC) reaches 94.23%. In the NEC prognostic prediction experiment, the AUROC of dataset 2 (medical NEC + surgical NEC) reaches 91.88%. Additionally, the classification accuracy of the RQBSO algorithm on the NEC dataset is higher than the other feature selection algorithms. Thus, the proposed approach has the potential to identify predictors that contribute to the diagnosis of NEC and stratification of disease severity in a clinical setting.


Introduction
Necrotizing enterocolitis (NEC) is one of the most devastating gastrointestinal diseases in the neonatal intensive care unit (NICU), with significant morbidity and mortality [1]. It is estimated that the incidence of NEC has been maintained at 3%-15%, and the mortality rate has been maintained at 20%-30% for decades [2,3]. In general, the diagnosis of NEC is based on a combination of clinical, laboratory, and radiographic symptoms, most of which are nonspecific or even insidious [4,5], such as abdominal distention and reduced bowel sounds as clinical indications for feeding intolerance (FI) and NEC. These insensitive features hinder timely diagnosis and accurate treatment. Due to the difficulty of early diagnosis of NEC and the lack of reliable biomarkers, it is essential to develop an effective diagnostic model of NEC to quickly and accurately identify the key information affecting the diagnosis and prognosis of NEC, leading to more timely treatment. NEC can be a rapidly progressing disease, and it may take only one to two days to progress from initial symptoms to full-blown illness and death. The severity of the disease is usually divided into "medical NEC" and" surgical NEC". Medical NEC refers only to medical management, while surgical NEC involves surgical intervention. In addition, as the disease progresses, the child's symptoms become more pronounced and the risk of long-term complications increases significantly, including neurocognitive impairment, developmental failure, short bowel syndrome, and cholestasis [6][7][8]. Therefore, it is necessary to identify high-risk infants before the disease progresses rapidly to ensure that therapeutic interventions can be initiated as soon as possible before bowel resection is required.
In recent years, machine learning (ML) methods have been widely used to diagnose cancer [9][10][11] and the other common diseases [12,13]. Many researchers have developed prediction models for early NEC diagnosis (suspected NEC + NEC) and graded NEC diagnosis (medical NEC + surgical NEC). In the feature selection stage, they use statistical analysis to extract important features. In the classification stage, most researchers use ML methods such as linear discriminant analysis (LDA) [14,15], random forest (RF) [16][17][18], or Light Gradient Boosting Machine (GBM) [19] as classifier models. Table 1 summarizes some studies using ML for the diagnosis or prognosis of NEC.
Most relevant studies perform well in the diagnosis and prognosis prediction of NEC. However, some key issues also need to be addressed. Firstly, most researchers use statistical methods to select features, which may ignore the correlation between features. Specifically, the idea of statistical methods is to use statistical significance to explore the association between each feature and category labels. Since there may be potential correlations between features, it is crucial to consider the correlation between features to ensure that the best performing subset of features is selected. Secondly, most researchers select a small number of features, which may overlook features that are highly correlated with predicted outcomes. Therefore, in order to solve the above problems, we need to include more features while considering their correlation. Feature selection is a fundamental task in machine learning and statistics, and has been proved to be an effective method to process feature-related data in previous studies [20,21]. Feature selection methods fall into three categories: filter methods, wrapper methods, and embedded methods. Filter methods [22][23][24][25] extract a subset of features from the initial dataset and use the correlation score for each feature created based on statistical measures to filter features. The advantage of this method is that the calculation is relatively easy and efficient. However, filter methods only rank features by their single-feature association with class labels and thus tend to ignore correlations between features [26]. Wrapper methods [27,28] integrate the classification algorithm into the feature selection process. Because wrappers directly optimize the target classification algorithm, they often achieve better classification performance than filters. Wrappers usually run much slower than filter methods due to their consideration of inter-feature relationships [29]. Embedded methods [30][31][32][33] use a classification learning algorithm to evaluate the validity of features, which retain the high precision of the wrapper methods and have the high efficiency of filter methods. However, the time complexity is relatively high when processing high-dimensional data, and the redundant features cannot be completely removed [34].
To address the above issues, various works are proposed to solve feature selection problems using metaheuristics [35]. Most of them use genetic algorithms (GA) [36][37][38][39]. Meta-heuristic algorithms based on swarm intelligence are also applied to feature selection, such as ant colony optimization (ACO) [40,41], particle swarm optimization (PSO) [42,43], and bee swarm optimization (BSO) [44,45]. Although metaheuristic algorithms are very effective in solving feature selection problems, the increasing number of features makes this task more and more difficult. Therefore, metaheuristic algorithms combined with machine learning and the other areas of approaches may achieve better results [46,47].
In this paper, we propose a novel algorithm called ridge regression and Q-learning strategy based bee swarm optimization (RQBSO) metaheuristic algorithm to predict NEC diagnosis and prognosis. Ridge regression is an embedded feature selection method. Compared with the other feature selection methods, the ridge regression algorithm can filter out irrelevant features while considering the correlation between features. Therefore, the ridge regression algorithm will help to screen irrelevant variables and improve the efficiency of the meta-heuristic algorithm search. To obtain the optimal feature subset, a Q-learning strategy based bee swarm optimization (QBSO) metaheuristic algorithm is used. The advantage of Q-learning is that it does not require a complete model of the fundamental problem, because learning is performed by gathering experience referred to as trial-error [48]. By combining Q-learning with the BSO algorithm, the BSO algorithm can be adaptive in the process of searching feature subsets. In the classification stage, since the RQBSO method outputs sparse feature vectors, a linear SVM specialized in processing such data is used as the classifier model.

Datasets
Settings and patients. This retrospective observational study was conducted in the neonatal intensive care unit (NICU) of Jilin University First Hospital, China, from January 1, 2015 to October 30, 2021 in accordance with the Helsinki Declaration of the World Medical Association. The study is approved by the Institutional Review Board of Jilin University First Hospital (Ethics No.2021-042). Due to the nature of the study, the informed consent from the parents/guardians of the patients is waived.
The infants with the presentation of FI who underwent abdominal imaging were enrolled, and their medical records were collected. FI is defined as "the inability to digest enteral feedings presented as gastric residual volume of more than 50%, abdominal distension or emesis or both, and the disruption of the patient's feeding plan" [49]. The exclusion criteria are as follows: (a) congenital malformations, (b) spontaneous bowel perforation, (c) emergency surgical conditions unrelated to NEC, and (d) incomplete information.
Data collection and definitions. The collected NEC and FI datasets include clinical patient information obtained between diagnosis and discharge from the NICU. The final diagnosis is determined by two independent senior neonatologists from an examination of the complete medical chart, including all perinatal and clinical findings, such as clinical manifestations, laboratory tests, the results of imaging, and the disease course. In case of disagreement between the two neonatologists, a consensus is reached with the help of a senior expert. We judge whether the infant experienced NEC based on modified Bell stage �IIA and then determine that the following criteria should be met in the whole disease course: (1) the presentation of FI; (2) abdominal signs (such as bowel sound attenuation and abdominal tenderness) and systemic signs (such as apnoea, lethargy, and temperature instability); and (3) antibiotics therapy and withholding feeds for at least one week [2,50].
The NEC group is further divided into a "medical NEC group" and a "surgical NEC group". Medical NEC involves only medical management, including withholding feeds, provision of parenteral nutrition, and empirical use of antibiotics, while surgical NEC involves surgical interventions, including laparotomy and peritoneal drainage. To avoid selection bias, infants who die from severe NEC disease are assigned to the surgical NEC group. Timing of NEC onset (t0) is defined as the earliest occurrence of one of the following, within 48 hours of confirmation: 1) first notification of abdominal problems by the neonatologist, 2) abdominal radiographs or abdominal ultrasound ordered, 3) stopping enteral feeding, or 4) initiation of antibiotics [51,52]. To identify predictors of NEC diagnosis and disease severity, we evaluate perinatal, clinical, and laboratory variables including treatment details prior to clinical onset of NEC in detail. A detailed description of each variable is shown in the S1 Table.

Methods
In this paper, we propose a feature selection cascade framework to address NEC diagnosis and prognosis prediction. Fig 1 shows the flowchart of our experiments, which can be divided into three stages: data preprocessing, feature selection using the RQBSO algorithm, and model classification.
All experiments are performed in a computer equipped with Jupyter notebook 3.6.1, which contains 16 GB RAM and an i7-6700 CPU clocked at 3.40 GHZ. All analyses are performed using the Scikit-learn library for Python 3.7 and the Matplotlib visualization tool.
Data preprocessing. First, we count the missing data and exclude clinical parameters from the study if they are missing more than 30%. Then the remaining missing values are filled using the k-nearest neighbor method. K-nearest neighbor filling is based on the principle that missing values are estimated and filled by the eigenvalues of the k nearest neighboring samples.
Assuming that x ai (the i-th feature of the a-th sample) is a missing value, the samples that do not contain the missing value at the corresponding position will serve as providers of training information (neighbors). The reciprocal of the Euclidean distance between the a-th sample and the b-th sample is used as the weight in filling (Eq (1)).
o ab ¼ 1 = ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi P where k denotes the k-th feature of the sample. The estimates of missing values can be filled

PLOS ONE
Estimation of diagnosis and severity of necrotizing enterocolitis using a weighted averaging eigenvalue of the nearest neighbor samples (Eq (2)).
Obviously, the closer the sample is to the target structure, the smaller the Euclidean distance between the two, the larger the weight factor and the greater the contribution provided to the missing value padding. However, this method has a problem that it can only estimate continuous variables, but not discrete variables. To address this drawback, we extend the existing knearest neighbor algorithm for estimating discrete variables by voting based on the k-nearest neighbor samples and using the nearest neighbor sample category with the most votes to fill in the missing values, as shown in Eq (3).
where K is the set of all k nearest neighbor samples. By adopting a hybrid strategy missing value filling method, we not only make effective use of the existing information, but also extend the application of the k-nearest neighbor feature filling method so that it can be used to fill both discrete and continuous variables.
We normalize the raw NEC data by the z-score algorithm [53] to eliminate the effects of inter-feature variation in magnitude and distribution. In addition, the normalized data can improve the convergence speed and prediction accuracy of the ML model.
Feature selection using RQBSO algorithm. RQBSO framework is a feature selection algorithm for the diagnosis and prognosis of NEC. It combines a ridge regression algorithm and Q-learning strategy based BSO metaheuristic algorithm. Unlike BSO, it can filter out irrelevant features by ridge regression technique, so there is no need to traverse all features in the search process of the BSO algorithm. Therefore, compared with BSO, the RQBSO algorithm has a faster training speed. Figs 2 and 3 show the structure and pseudocode of the RQBSO algorithm, respectively.
In the first stage of RQBSO, we collapse these NEC data vectors in the data input layer into a NEC data matrix suitable for processing by the feature selection algorithm. Eq (4) shows the process as ð4Þ where x ij denotes the j-th feature of the i-th sample. The matrix X is fed to the next stage for features prescreening.
In the second stage of RQBSO, ridge regression is used for preliminary screening of features. The purpose is to filter out irrelevant features, reduce the space of the feature search, and improve search efficiency. The optimization objective of Ridge is where x i ! denotes the i-th sample, y i denotes the i-th label, and the regularization parameter λ determines the compression degree of model coefficients. We use cross-validation to determine the appropriate λ value. To solve for the regression coefficient b ! , we take the partial derivative of b ! with respect to Eq (5), as shown in Eq (6) can be obtained (as shown in Eq (7) where I denotes identity matrix. The explainable model is obtained by filtering out the features with regression coefficients equal to zero, and the purpose of feature screening is achieved.
In the final stage of RQBSO, the features that flow into the next stage are further filtered using the QBSO feature selection method to obtain the optimal subset of features. The QBSO method can be roughly divided into three stages: the determination of the search area, the local search of the Q-learning strategy, and the determination of the optimal feature subset.
The determination of the search area. In the first iteration, 20% of features are randomly generated as the initial feature set, which is used as the initial reference solution Refsol. To obtain the feature subset of the search area, we use two different strategies to ensure that the feature subset obtained is as different as possible. In the first strategy, the k-th feature subset is generated by flipping starting from the k-th bit of RefSol with a flipping interval of n/flip bits. Here, flip is a hyper-parameter, and the size of this set is equal to the number of bees, which determines the number of features to filter from Refsol. As an example, let n = 20 and filp = 5, where n denotes the number of features in Refsol. If the index of features are 0 to 19, feature subsets f 0 , f 1 , f 2 , f 3 and f 4 are obtained by flipping the following bits, as shown in Fig 4: (0,5,10,15), (1,6,11,16), (2,7,12,17), (3,8,13,18) and (4,9,14,19). In the second strategy, the k-th subset of features is obtained by flipping n/flip contiguous bits starting from the k-th bit. Following the previous example, the feature subsets f 0 , f 1 , f 2 , f 3 and f 4 are obtained by flipping the following bits: (0,1,2,3), (4,5,6,7), (8,9,10,11), (12,13,14,15) and (16,17,18,19). With the above two searching strategies, we determine the search area for each bee.
The local search of the Q-learning strategy. After determining the search area of the feature sets, we perform a nearest neighbor search for each bee by flipping each bit of the feature set separately. We denote the action of flipping the current feature as a t (a t 2A t ) and the state as . .s t+n }, and denote a t+1 as the action at the next moment (flipping the next feature) and s t+1 as the state resulting from that action. By comparing the state at that moment with the next moment, we can obtain the reward r t when searching for a subset of features in different neighborhoods, as shown in Eq (8).
( where Acc(s t ) denotes the classification accuracy of the selected feature subset in the current state, and Acc(s t+1 ) denotes the classification accuracy of the selected feature subset in the next state. If the accuracy of selecting a subset of features in the current state is equal to that in the next state, then the reward r t is calculated by comparing the number of selected features in the two states, as shown in Eq (9).  (10).
where 0�γ�1 denotes the discount parameter, Q(s t+1 , a t+1 ) denotes the Q value under the next state and action, Q(s t , a t ) denotes the Q value under the current state and action, and the initial value of Q value is zero. By comparing the Q values under each feature subset, we select the feature subset with the largest Q value as the initial solution for each bee's next search, and continuously update the feature subset. The initial solution until a predetermined number of iterations is reached (localInteration), and finally return an optimal solution as the result of that bee's search.
The determination of the optimal feature subset. After determining the optimal solution for each bee's search, we compare its Q-value and return the feature subset corresponding to the largest Q-value as the reference solution for the next iteration. Then the (1) and (2) are repeated until a predefined number of iterations (MaxInteration) is reached. Finally, the feature set with the maximum Q-value is returned as the optimal feature subset. If the maximum Q value determined in this iteration is less than the maximum Q value of the previous iteration, we perform the diversification operation in the next iteration, that is, re-select 20% of the features at random as the solution for the next iteration, and then perform the (1) and (2) processes to determine the maximum Q value and continue the comparison with the current maximum Q value.
The advantage of the QBSO algorithm is that it processes learning through interactions with the environment. At the same time, the Q-learning adaptive searching method is used to avoid the problem of falling into local optimality.
Model classification. To evaluate the performance of the feature selection algorithm, we use a supervised classification model called linear SVM to calculate the classification accuracy. The linear SVM classifier is a popular supervised learning algorithm. It uses the computed decision hyperplane to classify samples. The choice of the error penalty factor, which represents the error tolerance, significantly affects the accuracy of the linear SVM. In our experiments, we use an SVM with a linear kernel function [54] and the parameter C set to 1.

Performance measurements
To obtain a highly robust model, we use ten-fold cross-validation in our experiments. Specifically, we randomly divide the experimental data into 10 equal parts. In each experiment, 9 copies of the data are selected in turn for training and the remaining data are tested. We take the average of the 10 results as an estimate of the model accuracy.
A binary classification algorithm optimizes the parameters of a model and predicts that a new sample belongs to the positive (P) or negative (N) group. The sizes of the positive and negative groups are respectively denoted as P and N. A positive sample is defined as a true positive or false negative if it is predicted as positive or negative. A negative sample is defined as a false positive or a true negative if its prediction is positive or negative. The numbers of true positives, false negatives, false positives, and true negatives are denoted as TP, FN, FP, and TN, respectively. The binary classification performance is evaluated by the following measurements, as [55]. This study defines recall (Rec) as the percentages of correctly predicted positive samples, i.e. Rec = TP/(TP+FN). The overall accuracy is defined as Acc = (TP+TN)/(TP+FN +TN+FP). F1-score is also known as F-measure or F-score and has been widely used to evaluate the performance of a binary classification model [56]. F1-score is defined as 2 � (Precision � Rec)/(Precision+Rec), and precision is defined as Pre = TP/(TP+FP). In addition, ROC and PRC curves reflect the relationship between true positive rates and false positive rates, precision and recall, respectively. They are often used as performance graphing methods in medical decision-making [57].

Comparison with other feature selection algorithms
We evaluate our proposed feature selection algorithm RQBSO and compare it with three major groups of feature selection methods, including two filter methods, namely Max-Relevance and Min-Redundancy (mRMR) [58] and ReliefF [59]. Three wrapper methods, namely GA [39], BSO [44], and recursive feature elimination (RFE) [60]. Our method is also compared with two leading embedded methods, namely LASSO [30], and Ridge regression [61]. The important parameter settings of the RQBSO algorithm are shown in Table 3. The hyperparameters of other methods are detailed in S2 Table. Fig 5A-5D and Table 4 show the comparison of the RQBSO algorithm with three sets of feature selection methods using ten-fold cross-validation. As shown in Fig 5A and 5B, RQBSO (orange curve) outperforms the other algorithms with AUROC values of 94.20% and 91.85% on both datasets. For the same FPR level, both our method obtains a higher TPR value, which is of great significance for the diagnosis and prognosis of NEC. The AUROC values of the two filter methods (mRMR and reliefF) perform poorly due to the failure to consider the correlation between features. The PRC curves in Fig 5C and 5D also confirms these results. RQBSO has the highest AUPRC values on both datasets with 97.42% and 84.61%, respectively. Table 4 shows that the classification accuracy of the NEC diagnosis and prognosis datasets are 91.07% and 84.37%, respectively. The advantage of our experimental accuracy is significant. In terms of accuracy, the prediction success rate of the RQBSO method exceeded 93% by conducting experiments in both dataset 1 and dataset 2. Compared with the other feature selection algorithms, our accuracy and precision are at a high level.

Feature importance analysis
We apply the RQBSO feature selection algorithm on dataset 1 and 2 to select the optimal feature set and calculate the final ranking of the selected features. The normalized importance scores of the selected features are presented in Tables 5 and 6.
In the differential diagnosis of NEC, placenta abnormalities, platelet distribution width (PDW) at birth are the two most important features. This is followed by type of milk, lymphocyte percentage (LY%) change, signs of peritoneal irritation, achieved full enteral feeding, and drowsiness (Table 5). Overall, perinatal features account for 7.96% of the differential diagnosis of NEC, clinical features before clinical onset account for 28.84%, clinical features at clinical onset account for 25.04%, and laboratory parameters account for 38.16%.
In the classification of NEC, anemia-RBC transfusion, signs of peritoneal irritation, acidosis, tachycardia, and white blood cell count (WBC) change are the top five most important features (Table 6). Overall, perinatal features account for 9.17% of NEC classification, clinical features before clinical onset account for 27.85%, clinical features at clinical onset account for 28.06%, and laboratory parameters account for 34.92%.

Comparison with other ML classifiers
In addition to linear SVM, we evaluate three representative classification algorithms on the dataset. The k-nearest neighbor (KNN) algorithm is a distance-based metric. The multi-layer perceptron (MLP) algorithm is one of the most widely used neural network models, and the algorithm is a multilayer feedforward neural network. The random forest (RF) algorithm is an integrated learning algorithm consisting of multiple decision trees.  We use four classifiers to classify the NEC dataset. Compared with KNN, MLP and RF methods, the linear SVM has faster training and classification speed because the linear SVM is a linear classifier well suited for high-dimensional features, and it also has good generalization ability. As shown in Fig 6, the linear SVM has AUROC values of 94.22% and 91.85% and AUPRC values of 97.43% and 85.36% in datasets 1 and 2. In contrast, the AUROC and AUPRC values of KNN, MLP and RF are lower than those of the linear SVM. Therefore, the linear SVM has a significant advantage over the other three classifiers in our experiments.

Predictive features
This study builds and tests a feature selection and classification algorithm that uses available data prior to disease onset for automatic diagnostic classification and NEC risk prediction. Using different ML-based classifiers trained and tested on different datasets, we obtain two general models with high accuracy and precision. In our multivariate feature selection algorithm, the previously described NEC parameters of higher WBC, signs of peritoneal irritation, and early clinical onset of NEC are significant weighted predictors of surgical NEC. Higher neutrophil percentage (NEUT%) at clinical onset, breast milk, and the use of probiotics are significant weighted predictors to identify classic NEC [7,14,62]. In addition, we also identify mean corpuscular hemoglobin (MCH) at clinical onset and anemia-RBC transfusion, which are risk factors for the development of NEC [63][64][65], as the weighted predictors of surgical NEC. This suggests that our feature selection method identifies pathophysiologically important predictors of NEC diagnosis and prognosis. Previously unreported key variables predicting NEC, such as some parameters in routine blood tests and their variations, should be brought to the attention of clinicians.

Strengths and limitations
One of the strengths of this study is the extensive collection of perinatal, clinical, and laboratory information, including topical issues in NEC in recent years, such as anemia-RBC transfusion and feeding strategies, which allows for a detailed assessment to predict the diagnosis and severity of NEC. In addition, we propose the RQBSO feature selection algorithm, which uses an integrated learning strategy that combines machine learning with a swarm optimization  algorithm. This algorithm achieves better feature selection results on both the NEC diagnosis and risk prediction datasets. The average classification accuracy of RQBSO filtered features is higher in both datasets. Moreover, most of the features filtered by RQBSO are clinically significant, and these important weighted predictive values deserve the attention of clinicians. The present study has some limitations. Firstly, the number of extracted features is disproportionate to the size of the dataset, which may affect the performance of our ML classifiers, and increasing the sample size would probably improve performance. Secondly, the Bell staging criteria used in this study provide a relatively poor description of bowel injury. Although we exclude possible confounding factors when separating medical NEC from surgical NEC, applying ML methods to classify datasets with poorly defined non-discrete entities may be flawed. Finally, the lack of out-of-sample validation and the single-center retrospective design make our models less applicable. We hope to validate our models with future data from our NICU or other NICUs.

Conclusion
In this work, we propose a new feature selection framework RQBSO for early diagnosis of NEC and identification of high-risk infants. To evaluate the effectiveness of our algorithms, we conduct experiments on two skewed datasets of NEC differential diagnosis and risk prediction. In the end, we classify the NEC differential diagnosis data with an average recognition accuracy of 91.07% and an AUROC value of 94.20%. While the accuracy of the other set is only 84.37%, and the AUROC value is 91.85%. The experimental results show that the method has a high recognition accuracy in the differential diagnosis and risk prediction of NEC. In addition, the method screens some new significant weighted predictors that may lead to earlier identification and more timely treatment.
In future work, we plan to apply our method to higher-dimensional datasets and perform deeper parameter tuning to investigate their impact on algorithm performance.